Fix data loss when the old primary takes over the slots after online #974

enjoy-binbin · 2024-08-31T11:32:07Z

There is a race in clusterHandleConfigEpochCollision, which may cause
the old primary node to take over the slots again after coming online
and cause data loss. It happens when the old primary and the new primary
have the same config epoch, and the old primary has a smaller node id
and win the collision.

In this case, the old primary and the new primary are in the same shard,
we are not sure which is strictly the latest. To prevent data loss,
now in clusterHandleConfigEpochCollision we will let the node with the
larger offset win the conflict.

In addition to this change, when a node increments the config epoch
throught conflicts, or CLUSTER FAILOVER TAKEOVER, or CLUSTER BUMPEPOCH,
we will send PONGs to all ndoes to allow the cluster to reach consensus
on the new config epoch more quickly.

This also can closes #969.

There is a race in clusterHandleConfigEpochCollision, which may cause the old primary node to take over the slots again after coming online and cause data loss. It happens when the old primary and the new primary have the same config epoch, and the old primary has a smaller node id and win the collision. In this case, the old primary and the new primary are in the same shard, we are not sure which is strictly the latest. To prevent data loss, now in clusterHandleConfigEpochCollision we will let the node with the larger offset win the conflict. In addition to this change, when a node increments the config epoch throught conflicts, or CLUSTER FAILOVER TAKEOVER, or CLUSTER BUMPEPOCH, we will send PONGs to all ndoes to allow the cluster to reach consensus on the new config epoch more quickly. This also can closes valkey-io#969. Signed-off-by: Binbin <[email protected]> Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-08-31T11:39:10Z

Here is the logs:

The old primary:

# The old primary, restarting
2024-08-30T05:51:13.1503179Z ### Starting test Restarting the previously killed master nodes in tests/unit/cluster/manual-takeover.tcl

# Get configEpoch collision with other primary (including the new primary) and the old primary won the collision.
2024-08-30T05:51:13.1932876Z 14312:M 30 Aug 2024 05:50:20.994 * Connection with replica 127.0.0.1:28624 lost.
2024-08-30T05:51:13.1934608Z 14312:M 30 Aug 2024 05:50:20.994 * FAIL message received from 4260b224913d8b1d0dcf27a909983ac412d171f0 () about 77296d415c4b47c081ef5c99a1720061717e06de ()
2024-08-30T05:51:13.1936882Z 14312:M 30 Aug 2024 05:50:20.994 * FAIL message received from 4260b224913d8b1d0dcf27a909983ac412d171f0 () about c0bc5fd54657c5b445405d4ac18b84345959fbcc ()
2024-08-30T05:51:13.1939066Z 14312:M 30 Aug 2024 05:50:21.000 * configEpoch collision with node 6bb7d855c2b596cab135cbabcf1903ea20d2699a (). configEpoch set to 8
2024-08-30T05:51:13.1946224Z 14312:M 30 Aug 2024 05:50:21.003 * configEpoch collision with node 8de0417aba0b87c1e0cfa605766d8e2a9a7d41d4 (). configEpoch set to 9

# Sending a UPDATE, and make the new primary become a replica.
2024-08-30T05:51:13.1982806Z 14312:M 30 Aug 2024 05:50:21.004 - Node 8de0417aba0b87c1e0cfa605766d8e2a9a7d41d4 has old slots configuration, sending an UPDATE message about 30e8a59b00dfefddb91659aa15cc4d24da0c00f7
2024-08-30T05:51:13.1987920Z 14312:M 30 Aug 2024 05:50:21.004 - Client closed connection id=10 addr=127.0.0.1:43115 laddr=127.0.0.1:28629 fd=37 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=20474 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=37760 events=r cmd=NULL user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=0 tot-net-out=0 tot-cmds=0
2024-08-30T05:51:13.2031232Z 14312:M 30 Aug 2024 05:50:21.007 - Node 8de0417aba0b87c1e0cfa605766d8e2a9a7d41d4 has old slots configuration, sending an UPDATE message about 30e8a59b00dfefddb91659aa15cc4d24da0c00f7
2024-08-30T05:51:13.2033814Z 14312:M 30 Aug 2024 05:50:21.007 * Node 15a7b11e4dbfbefe354a41ed39e535694d6c8058 () reported node 77296d415c4b47c081ef5c99a1720061717e06de () is back online.

# The new primary become a replica
2024-08-30T05:51:13.2036446Z 14312:M 30 Aug 2024 05:50:21.014 * Replica 127.0.0.1:28624 asks for synchronization
2024-08-30T05:51:13.2039410Z 14312:M 30 Aug 2024 05:50:21.014 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for 'e3dde8f6addf6deed58fd8f1a91024f399850238', my replication IDs are '480e5eaf42d14b1037c43eb39525c965cfd4d752' and '0000000000000000000000000000000000000000')
2024-08-30T05:51:13.2042187Z 14312:M 30 Aug 2024 05:50:21.014 * Starting BGSAVE for SYNC with target: replicas sockets using: normal sync
2024-08-30T05:51:13.2043751Z 14312:M 30 Aug 2024 05:50:21.015 * Background RDB transfer started by pid 16253 to pipe through parent process
2024-08-30T05:51:13.2045418Z ### Starting test Instance #0, #1, #2 gets converted into a slaves in tests/unit/cluster/manual-takeover.tcl

The new primary (using a failover takeover, and became a replica at the last)

# The new primary (the replica which use the FAILOVER TAKEOVER)
2024-08-30T05:51:14.2595326Z 14142:S 30 Aug 2024 05:50:17.310 * Taking over the primary (user request).
2024-08-30T05:51:14.2596355Z 14142:S 30 Aug 2024 05:50:17.310 * New configEpoch set to 8
2024-08-30T05:51:14.2597130Z 14142:M 30 Aug 2024 05:50:17.310 * Connection with primary lost.
2024-08-30T05:51:14.2612166Z 14142:M 30 Aug 2024 05:50:17.310 * Caching the disconnected primary state.
2024-08-30T05:51:14.2613391Z 14142:M 30 Aug 2024 05:50:17.310 * Discarding previously cached primary state.
2024-08-30T05:51:14.2615593Z 14142:M 30 Aug 2024 05:50:17.310 * Setting secondary replication ID to 480e5eaf42d14b1037c43eb39525c965cfd4d752, valid up to offset: 1354. New replication ID is e3dde8f6addf6deed58fd8f1a91024f399850238

# delete some noise

# The old primary online again
2024-08-30T05:51:14.2800774Z 14142:M 30 Aug 2024 05:50:20.995 - Node 30e8a59b00dfefddb91659aa15cc4d24da0c00f7 has old slots configuration, sending an UPDATE message about 8de0417aba0b87c1e0cfa605766d8e2a9a7d41d4
2024-08-30T05:51:14.2803401Z 14142:M 30 Aug 2024 05:50:20.999 * Node 15a7b11e4dbfbefe354a41ed39e535694d6c8058 () reported node 77296d415c4b47c081ef5c99a1720061717e06de () is back online.
2024-08-30T05:51:14.2805843Z 14142:M 30 Aug 2024 05:50:21.003 * Node 6bb7d855c2b596cab135cbabcf1903ea20d2699a () reported node 77296d415c4b47c081ef5c99a1720061717e06de () is back online.
2024-08-30T05:51:14.2808216Z 14142:M 30 Aug 2024 05:50:21.009 * Clear FAIL state for node 30e8a59b00dfefddb91659aa15cc4d24da0c00f7 (): primary without slots is reachable again.

# The old primary win the collision and the new primary are losing its slot and become a replica.
2024-08-30T05:51:14.2810907Z 14142:M 30 Aug 2024 05:50:21.010 * Configuration change detected. Reconfiguring myself as a replica of node 30e8a59b00dfefddb91659aa15cc4d24da0c00f7 () in shard d5800922644aa5418d1d64a243612a1265c0d042
2024-08-30T05:51:14.2877461Z 14142:S 30 Aug 2024 05:50:21.010 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
2024-08-30T05:51:14.2892822Z 14142:S 30 Aug 2024 05:50:21.010 * Connecting to PRIMARY 127.0.0.1:28629

enjoy-binbin · 2024-08-31T11:40:05Z

tests/unit/cluster/manual-takeover.tcl

@@ -22,8 +22,12 @@ set paused_pid [srv 0 pid]
 set paused_pid1 [srv -1 pid]
 set paused_pid2 [srv -2 pid]
 test "Killing majority of master nodes" {
+    # Bumping the epochs to increase the chance of conflicts.


without the fix, this can lead to increased conflicts, which can cause tests to fail

I think the original issue is less about the non-deterministic epoch resolution behavior we have today but more about the race between the primary election and the forced takeover. I think that would be addressed if we disable the replica failover before pausing the primaries.

The change of bumping the epoch before pausing the primaries changes the intent of the test IMO to validate your "more deterministic conflict resolution" proposal, which does require increasing the conflict probably.

Can we make a surgical fix to the original test issue but create a new file to add the new test? It is a bit confusing to include it in a test file called manual-takeover as conflicts can happen outside manual-takeover too. maybe "cluster-epoch-conflict.tcl" or just "epoch-conflict.tcl"?

i am fail to see why disbale the replica failover will work. Do you mean disable the it on the replica side? The code already doing it. Or do you mean disable it on the primary side?

# For this test, disable replica failover until # all of the primaries are confirmed killed. Otherwise # there might be enough time to elect a replica. set replica_ids { 5 6 7 } foreach id $replica_ids { R $id config set cluster-replica-no-failover yes } set paused_pid [srv 0 pid] set paused_pid1 [srv -1 pid] set paused_pid2 [srv -2 pid] test "Killing majority of master nodes" { pause_process $paused_pid pause_process $paused_pid1 pause_process $paused_pid2 } foreach id $replica_ids { R $id config set cluster-replica-no-failover no }

yes, it is a good idea to split it, i will find some ways to do it.

i am fail to see why disbale the replica failover will work.

I think there is a race condition between turning off failover on the replicas and stopping the primary. I meant to suggest the following changes (reverting the order R $id config set cluster-replica-no-failover no and pause_process $paused_pid). Can you give it a try (without the server code change)?

Suggested change

# Bumping the epochs to increase the chance of conflicts.

foreach id $replica_ids {

R $id config set cluster-replica-no-failover no

}

test "Killing majority of master nodes" {

pause_process $paused_pid

pause_process $paused_pid1

pause_process $paused_pid2

}

What is the specific race? look like you are suggested to remove this one? e81bd15

… and we lost the context Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-08-31T11:43:03Z

src/cluster_legacy.c

    serverLog(LL_NOTICE, "configEpoch collision with node %.40s (%s). configEpoch set to %llu", sender->name,
              sender->human_nodename, (unsigned long long)myself->configEpoch);
+    clusterSaveConfigOrDie(1);


print the logs first and then save the config, in case the save fails and we lost the context

I think this is a good context to keep in the code. add it as code comment?

SGTM, i will add this in the next commit.

codecov · 2024-08-31T11:57:34Z

Codecov Report

Attention: Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 70.47%. Comparing base (fea49bc) to head (b395193).

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	50.00%	5 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #974      +/-   ##
============================================
- Coverage     70.50%   70.47%   -0.04%     
============================================
  Files           114      114              
  Lines         61742    61750       +8     
============================================
- Hits          43532    43519      -13     
- Misses        18210    18231      +21

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`85.99% <50.00%> (+<0.01%)`	⬆️

... and 10 files with indirect coverage changes

zuiderkwast

I hope this solves the problem! I'm not sure about the theoretical stuff. @PingXie what do you think?

zuiderkwast · 2024-08-31T12:18:42Z

src/cluster_legacy.c

 * with the conflicting epoch (the 'sender' node), it will assign itself
 * the greatest configuration epoch currently detected among nodes plus 1.
 *
+ * The above is an optimistic scenario. It this node and the sender node
+ * are in the same shard, their conflict in configEpoch indicates that a
+ * node has experienced a partition. Or for example, the old primary node
+ * was down then up again, and the new primary node won the election. In
+ * this case, we need to take the replication offset into consideration,
+ * otherwise, if the old primary wins the collision, we will lose some of
+ * the new primary's data.
+ *
 * This means that even if there are multiple nodes colliding, the node
 * with the greatest Node ID never moves forward, so eventually all the nodes
 * end with a different configuration epoch.


It's better to describe this as if same shard ... otherwise ... like you did in another code comment below:

/* If sender and myself are in the same shard, the one with the * bigger offset will win. Otherwise if sender and myself are not * in the same shard, the one will the lexicographically small * Node ID will win.*/

And the text below the added text describes the guarantee "the node with the greatest Node ID never moves forward". We need to change this text.

Do we have another guarantees in the same-shard scenario? Can it happen that sender and myself don't have the same view on the replication offsets and both try to bump the epoch?

And what if one node thinks that it is in the same shard but the other one has a different idea and tries to compare by node-id? Can it happen that both try to bump the epoch?

some thoughts:

shard-id is mostly stable except in two situations:

case 1. The delayed propagation of shard-id, as captured in #778 and #573. I am still trying to wrap my head around all the possible permutations.

case 2. cluster-allow-replica-migration when enabled would allow a replica or primary to join another shard if their shard lost their last slot to the other shard. BTW, I don't get the use case of this setting and even more why it is enabled by default. This setting created quite a few surprises IMO. We disable it on GCP Memorystore.

In both cases, we could have a split "are_in_the_same_shard" view on the nodes involved.

case 1. The split view could happen such that node A correctly concludes that it is in the same shard as B but B, because it hasn't received A real shard_id, thinks otherwise and relies on the node-id only to resolve the conflict. When this happens, the conflict continues. Eventually, B will receive A's real shard_id and should arrive at the same conclusion as A. Data written to the losing node during this transitional period will get lost when the conflict is eventually resolved.

case 2. The split could happen in both the source and target shards but I think the end result is innocuous in the source shard since it will be empty by then. The impact on the target shard should be mitigated by #885

We will always lose data (except for the trivial case of no user writes) even if we pick the larger offset of the two. The replication history diverges the moment the "split brain" occurs even the two still share the same repl_id, which is what leads to the epoch conflict. By favoring the larger offset, we are saying the one with more user writes wins. I think this is a reasonable decision but just wanted to clarify that the two no longer share the same replication history despite having the same repl_id.

btw, @madolson, this is one of the many places where I fully agree with you a new clustering solution that is designed holistically would fare better.

PingXie

overall I think this is a good fix but I feel that we can address #969 by disabling replica votes before pausing the primaries. @enjoy-binbin can you give that a shot? If it works, I would suggest moving this fix out of 8.0.

PingXie · 2024-09-01T07:41:18Z

src/cluster_legacy.c

    serverLog(LL_NOTICE, "configEpoch collision with node %.40s (%s). configEpoch set to %llu", sender->name,
              sender->human_nodename, (unsigned long long)myself->configEpoch);
+    clusterSaveConfigOrDie(1);


I think this is a good context to keep in the code. add it as code comment?

PingXie · 2024-09-01T07:58:10Z

tests/unit/cluster/manual-takeover.tcl

@@ -22,8 +22,12 @@ set paused_pid [srv 0 pid]
 set paused_pid1 [srv -1 pid]
 set paused_pid2 [srv -2 pid]
 test "Killing majority of master nodes" {
+    # Bumping the epochs to increase the chance of conflicts.


I think the original issue is less about the non-deterministic epoch resolution behavior we have today but more about the race between the primary election and the forced takeover. I think that would be addressed if we disable the replica failover before pausing the primaries.

The change of bumping the epoch before pausing the primaries changes the intent of the test IMO to validate your "more deterministic conflict resolution" proposal, which does require increasing the conflict probably.

Can we make a surgical fix to the original test issue but create a new file to add the new test? It is a bit confusing to include it in a test file called manual-takeover as conflicts can happen outside manual-takeover too. maybe "cluster-epoch-conflict.tcl" or just "epoch-conflict.tcl"?

PingXie · 2024-09-01T08:39:46Z

src/cluster_legacy.c

 * with the conflicting epoch (the 'sender' node), it will assign itself
 * the greatest configuration epoch currently detected among nodes plus 1.
 *
+ * The above is an optimistic scenario. It this node and the sender node
+ * are in the same shard, their conflict in configEpoch indicates that a
+ * node has experienced a partition. Or for example, the old primary node
+ * was down then up again, and the new primary node won the election. In
+ * this case, we need to take the replication offset into consideration,
+ * otherwise, if the old primary wins the collision, we will lose some of
+ * the new primary's data.
+ *
 * This means that even if there are multiple nodes colliding, the node
 * with the greatest Node ID never moves forward, so eventually all the nodes
 * end with a different configuration epoch.


some thoughts:

shard-id is mostly stable except in two situations:

case 1. The delayed propagation of shard-id, as captured in #778 and #573. I am still trying to wrap my head around all the possible permutations.

case 2. cluster-allow-replica-migration when enabled would allow a replica or primary to join another shard if their shard lost their last slot to the other shard. BTW, I don't get the use case of this setting and even more why it is enabled by default. This setting created quite a few surprises IMO. We disable it on GCP Memorystore.

In both cases, we could have a split "are_in_the_same_shard" view on the nodes involved.

case 1. The split view could happen such that node A correctly concludes that it is in the same shard as B but B, because it hasn't received A real shard_id, thinks otherwise and relies on the node-id only to resolve the conflict. When this happens, the conflict continues. Eventually, B will receive A's real shard_id and should arrive at the same conclusion as A. Data written to the losing node during this transitional period will get lost when the conflict is eventually resolved.

case 2. The split could happen in both the source and target shards but I think the end result is innocuous in the source shard since it will be empty by then. The impact on the target shard should be mitigated by #885

We will always lose data (except for the trivial case of no user writes) even if we pick the larger offset of the two. The replication history diverges the moment the "split brain" occurs even the two still share the same repl_id, which is what leads to the epoch conflict. By favoring the larger offset, we are saying the one with more user writes wins. I think this is a reasonable decision but just wanted to clarify that the two no longer share the same replication history despite having the same repl_id.

PingXie · 2024-09-01T08:42:09Z

src/cluster_legacy.c

 * with the conflicting epoch (the 'sender' node), it will assign itself
 * the greatest configuration epoch currently detected among nodes plus 1.
 *
+ * The above is an optimistic scenario. It this node and the sender node
+ * are in the same shard, their conflict in configEpoch indicates that a
+ * node has experienced a partition. Or for example, the old primary node
+ * was down then up again, and the new primary node won the election. In
+ * this case, we need to take the replication offset into consideration,
+ * otherwise, if the old primary wins the collision, we will lose some of
+ * the new primary's data.
+ *
 * This means that even if there are multiple nodes colliding, the node
 * with the greatest Node ID never moves forward, so eventually all the nodes
 * end with a different configuration epoch.


btw, @madolson, this is one of the many places where I fully agree with you a new clustering solution that is designed holistically would fare better.

madolson · 2024-09-02T01:41:11Z

src/cluster_legacy.c

    /* Get the next ID available at the best of this node knowledge. */
    server.cluster->currentEpoch++;
    myself->configEpoch = server.cluster->currentEpoch;
-    clusterSaveConfigOrDie(1);
+    clusterBroadcastPong(CLUSTER_BROADCAST_ALL);


Can we document why there is a broadcast here as well. If there is any type of significant conflicts we'll send out a lot of messages unnecessary.

yes, we can document it. i guess in normal cases, conflicts should be rare.

I think i am doing this here is also because, if the primary has a larger epoch, and its replicas don't know it in time, the replica may not be able to win the election (it has an old epoch and will get rejected in this case, but rare just my guess)

madolson · 2024-09-02T01:46:39Z

src/cluster_legacy.c

+     * bigger offset will win. Otherwise if sender and myself are not
+     * in the same shard, the one will the lexicographically small
+     * Node ID will win.*/
+    if (areInSameShard(sender, myself)) {


I suppose I don't get why this works. If we both nodes think they are primaries, won't both of our offsets be the replication offsets we are now using could still be the same, and both nodes would both bump. The previous logic was consistent between nodes, whereas this logic may not and they may fight? It feels like this is just a race between the manual failover and the automated failover, and I'm not sure we necessarily should automatically be picking a winner.

if the offset is the same, yes, it will still have the same issue, in this case, if the offset is the same, we will use the origin node id logic to pick a winner, in this case, the logic is the same as before.

i guess in here the only problem i want to solve is: the old primary paused or block (offset is small), and the new primary was elected and did some writes (offset is large). They encountered a conflict and the old primary win, got a new config epoch and forced the new primary to become a replica again.

enjoy-binbin commented Aug 31, 2024

View reviewed changes

enjoy-binbin requested review from zuiderkwast, PingXie and madolson August 31, 2024 11:40

print the logs first and then save the config, in case the save fails…

b395193

… and we lost the context Signed-off-by: Binbin <[email protected]>

enjoy-binbin commented Aug 31, 2024

View reviewed changes

zuiderkwast reviewed Aug 31, 2024

View reviewed changes

PingXie reviewed Sep 1, 2024

View reviewed changes

madolson reviewed Sep 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data loss when the old primary takes over the slots after online #974

Fix data loss when the old primary takes over the slots after online #974

enjoy-binbin commented Aug 31, 2024

enjoy-binbin commented Aug 31, 2024

enjoy-binbin Aug 31, 2024

PingXie Sep 1, 2024

enjoy-binbin Sep 1, 2024

PingXie Sep 1, 2024 •

edited

Loading

enjoy-binbin Sep 2, 2024

enjoy-binbin Aug 31, 2024

PingXie Sep 1, 2024

enjoy-binbin Sep 1, 2024

codecov bot commented Aug 31, 2024

zuiderkwast left a comment

zuiderkwast Aug 31, 2024

PingXie Sep 1, 2024

PingXie Sep 1, 2024

PingXie left a comment

PingXie Sep 1, 2024

PingXie Sep 1, 2024

PingXie Sep 1, 2024

PingXie Sep 1, 2024

madolson Sep 2, 2024

enjoy-binbin Sep 2, 2024

madolson Sep 2, 2024

enjoy-binbin Sep 2, 2024

-    # Bumping the epochs to increase the chance of conflicts.
+    foreach id $replica_ids {
+        R $id config set cluster-replica-no-failover no
+    }
+    test "Killing majority of master nodes" {
+        pause_process $paused_pid
+        pause_process $paused_pid1
+        pause_process $paused_pid2
+   }

Fix data loss when the old primary takes over the slots after online #974

Are you sure you want to change the base?

Fix data loss when the old primary takes over the slots after online #974

Conversation

enjoy-binbin commented Aug 31, 2024

enjoy-binbin commented Aug 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie Sep 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 31, 2024

Codecov Report

zuiderkwast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie Sep 1, 2024 •

edited

Loading